home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Carousel Volume 2 #1
/
carousel.iso
/
mactosh
/
appl
/
browser.sit
/
Browser tech note
< prev
next >
Wrap
Text File
|
1987-05-14
|
8KB
|
173 lines
Technical note on the Browser - 1987 April 12 - Updated 1987 Apr 29 & May 14
by Mark^Zimmermann
9511 Gwyndale Drive
Silver Spring, MD 20910
301-565-2166
science@@nems.arpa - 75066,2044 on CompuServe
This note is meant to describe briefly the design philosophy and
algorithms used in Browser (a.k.a. Tiny Browser, Recall, Boxcars, etc.)
as of today. For additional details, contact me at the above
addresses, or consult the source code of the program itself.
OVERVIEW
The Browser goal is to provide a user with an exceedingly
simple, exceeding fast environment in which to be exceedingly
productive in working with very large (many megabytes) but very
disorganized files of free-text information. In order to be useful
in the real world, the program must be up and running (not just a
research project) this year. It must run on real hardware,
available to real people, not on systems too expensive,
experimental, or unreliable to use. It must not require its users
to go through gyrations in order to get data into or out of the
program. It must not assume that data comes in neat, tidy,
well-formatted bundles.
Ideally, the Browser should run on any piece of hardware
that is available. Practically, the first system that it is being
developed on is the Apple Macintosh, because I have one and because
it has an excellent user interface and programming environment for
developing the Browser. A near-term goal is to transport some
version of the program to the IBM-PC and compatibles, and to the
Sun Workstation. Longer term goals are to make all or part of it
run on mainframe IBM computers, Cray supercomputers, etc.
The program at present is written in MacForth Plus and runs on
any Macintosh system with 512KB of memory or more, including the
Mac II. It consists of about 2,000 lines of Forth code, heavily
commented. Run the program to get an idea of what features have
been implemented and what the user interface looks like ... they
change from week to week as I put more things in.
DATA STRUCTURES
In keeping with the spirit of the program, as well as the
simple-mindedness of the programmer, there is really only one data
structure used in the Browser, and it is highly primitive.
Above all, locality of reference is the Holy Grail -- with the idea
of enhancing the program's speed when running with devices such as
optical disks, which currently have rather slow rates for moving
their heads around on the media. Side-effects of the simple
approach are good execution speed, simplicity in sorting and
accessing the data, and ease of debugging. The main penalty is
wasted disk storage space -- a penalty that is becoming
increasingly irrelevant as big, fast mass media become cheaper
every day.
The inputs to the Browser are two types of files: text
files and index files. Text files consist of ASCII characters, as
usual. An index file consists of index records of 16 bytes each:
- 4 bytes of pointer into the text file being indexed, and
- 12 bytes of index key
The index key is extracted from the text file beginning at the
pointer value, turned into all capital letters, and filled out with
blanks from the end of the word indexed to the end of the 12 bytes.
These 16-byte index records are then simply arranged in
alphabetical order, sorted by their index key field, in a linear
array -- the index file.
For a specific example, if the input file consisted of the
words "Now is the time now" then the index file would look like:
4IS 0NOW 16NOW 7THE 11TIME
where there is obviously a lot of wasted blank space.
The recent addition of a Subindex search capability, described
below, arguably adds another data structure to the Browser; see the
discussion in that section for details.
ALGORITHMS
Similar to the discussion of data structures above, there
is only one algorithm that deserves the name in the current
Browser, and that algorithm is many decades old and very simple.
In order to sort the index of words into alphabetical order, the
program executes a classical radix sort.
This is the same sort used on old punched-card sorters.
Starting with the rightmost of the 12 columns of the index key
field, we take the index records and throw them into bins
corresponding to the letter in that column. We take the contents
of those bins and stack them together, and repeat the process, this
time using the letter in column 11 to determine which bin each card
goes into. Etc., etc., until the final pass on column 1 ends with
all of the cards, I mean index records, sorted. Try it out by
hand to get an idea of how it works, or read about it in any
standard textbook.
The radix sort has many advantages besides its simplicity. It
takes time that grows linearly with the number of index records to
be sorted -- thus, there is no logarithmically-increasing speed
penalty. It is a stable sort, in that identical words will end up
in the index appearing in the same order in which they appeared in
the original file. (That is convenient for the user who wants to
browse sequentially through the data.) The radix sort vectorizes
well, an advantage when sorting on supercomputers or other
vector-oriented machines. Additionally, the radix sort can be used
with a very simple data buffering structure, since information is
read in strictly sequentially from mass storage, and is moved out
in simple, repetitive patterns.
The simple radix sort does suffer from one disadvantage: it
requires free space on the disk equal in size to the file being
sorted. That space is used for the storage of the "bins" into
which items are thrown during the 12 sorting passes, and it is free
again at the end of the sort. I have not found that requirement
for space to be a significant problem. (In most cases, a user
would like to keep some space free for later data accumulation,
note-taking, etc., and so it seems unlikely that a file that
occupied all of free disk space would ever need to be sorted.)
Other than the radix sort, the only other computational
technique that comes to mind as being used in the Browser is
the binary search, in order to respond to a user's keystrokes in
the Index Window display. Does that count as an "algorithm"?
SUBINDEX IMPLEMENTATION
Versions of Browser after v.240 have a new capability: Subindex
searching. If a user wants to find, for example, all occurrences of
the words "computer" or "computers" near the word "Apple", it is now
easy to do so without being forced to visually scan through a large
number of irrelevant context items.
The way I have implemented that is to use a simple linear
array of flag bits, one for every 32 characters (bytes) in
the original text file. If a flag bit is set to 1, then that
region of the database is "valid"; if 0, it is "invalid". The
Index Window now counts and displays occurrences of indexed words
in valid regions of the data file. That information is shown along
with the total number of occurrences of items in the database (including
valid and invalid regions). A user can clear ("Empty") out the working
subindex, can set it to equal the entire database ("Fill"), and can
take the complement of the subindex ("Invert", a boolean NOT operation).
The bit array for the new Index window starts out all 0's. As a
user selects items by shift-clicking on them, flag bits around the
terms selected are set to 1's. The distance on each side of each
selected term to set bits to 1's is a user option; for close
proximity, only a few bits are be set; for more remotely
linked terms, many bits are set. The flag array can be
held completely in memory, since it is a factor of 256 smaller than
the original database file. So, operations on it are very fast.
CONCLUSIONS
Ultimately, the Browser should be a natural extension of
a user's memory. It thus must be both fast and simple. In order
to be used, it must also be cheap, both in acquisition cost and in
the cost of getting data into and out of it. With a simple enough
programming philosophy, and modern hardware and systems software,
that goal should achievable.
"Make everything as simple as possible, but no simpler."
- A. Einstein